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Existing probabilistic scanners and parsers impose hard constraints on the way lexical and syn- 
tactic ambiguities can be resolved. Furthermore, traditional grammar-based parsing tools are 
limited in the mechanisms they allow for taking context into account. In this paper, we pro- 
pose a model-driven tool that allows for statistical language models with arbitrary probability 
estimators. Our work on model-driven probabilistic parsing is built on top of ModelCC, a model- 
based parser generator, and enables the probabilistic interpretation and resolution of anaphoric, 
cataphoric, and recursive references in the disambiguation of abstract syntax graphs. In order 
to prove the expression power of ModelCC, we describe the design of a general-purpose natural 
language parser. 



I. INTRODUCTION 

Natural languages suffer from lexical ambiguities and 
syntactic ambiguities. Lexical ambiguities [l(| occur 
when a lexeme has several meanings |7(. Syntactic am- 
biguities occur when a token sequence can be generated 
using more than one parse tree [1| . 

A common approach to disambiguation consists of per- 
forming probabilistic scanning (i.e. probabilistic lexical 
analysis) and probabilistic parsing (i.e. probabilistic syn- 
tactic analysis), which assign a probability to each pos- 
sible parse tree. However, existing techniques for proba- 
bilistic scanning and parsing present several drawbacks: 
probabilistic scanners may produce incorrect sequences of 
tokens due to wrong guesses or to occurrences of words 
that are not in the lexicon, and probabilistic parsers 
cannot consider relevant context information such as re- 
solved references between language elements. 

Model-based language specification techniques [1] de- 
couple language design from language processing. Mod- 
elCC [H, EH is a model-based parser generator that in- 
cludes support for dealing with references between lan- 
guage elements and, thus, instead of returning mere ab- 
stract syntax trees, ModelCC is able to obtain abstract 
syntax graphs and consider lexical and syntactic ambi- 
guities. 

In this paper, we explain how ModelCC supports prob- 
abilistic language models. Section [TT] provides an intro- 
duction to probabilistic parsing techniques and to the 
model-based language specification techniques employed 
by the ModelCC parser generator. Section Mil explains 
the probabilistic model support in ModelCC. Section [IV] 
presents a case study that illustrates. Finally, Section Ivl 
presents our conclusions and pointers for future work. 



II. BACKGROUND 

In this section, we provide an analysis of the state of 
the art on probabilistic parsing and on model-based lan- 
guage specification. 



A. Probabilistic Parsing 

There are many approaches to part-of-speech tagging 
using probabilistic scanners and for language disambigua- 
tion using probabilistic parsers. 

Probabilistic scanners based on Markov-like models Q 
consider the existence of implicit relationships between 
words, symbols or characters found close in sequences, 
and irrevocably guess the type of a lexeme based on the 
preceding ones. When using such techniques, a single 
wrong guess renders the whole parsing procedure irre- 
mediably erroneous, as no correct parse tree that uses a 
wrong token can be found. 

Probabilistic scanners based on lexicons Q assign 
probabilities to a lexeme belonging to different word 
classes from the statistical analysis of lexicons. Scan- 
ning a lexeme that belongs to a particular word class but 
never belonged to that class in the training lexicon pro- 
vides wrong scanning results, which, in turn, render the 
whole parsing procedure useless. 

Probabilistic parsers [ll[ compute the probability of 
different parse trees by considering token probabilities 
and grammar production probabilities, which are empir- 
ically obtained from the analysis of linguistic corpora. 
The probability of a symbol is defined as the product of 
the probability of the grammar rule that produced the 
symbol and the probabilities of all the symbols involved 
in the application of that rule. The probability of a parse 
tree is that of its root symbol. These techniques do not 
take context into account. 

Probabilistic lexicalized parsers 0, H[ associate lexical 
heads and head tags to the grammar symbols. Gram- 
mar rules are then decomposed and rewritten to include 
the different combinations of symbols, lexical heads, and 
head tags. Different probabilities can be associated to 
each of the new rules. When using this technique, the 
grammar significantly expands and a more extensive 
analysis of linguistic corpora is needed to produce ac- 
curate results. It should be noted that this technique is 
not able to consider relevant context information such as 
resolved references between language elements. 
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Figure 1: Traditional language processing. 
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Figure 2: Model-based language processing. 



is created, constraints can be imposed over language el- 
ements and their relationships as metadata annotations 
6] in order to produce the desired ASM-CSM mappings. 

Although probabilistic language processing techniques 
and model-based language specification have been exten- 
sively studied, to the best of our knowledge, there are no 
techniques that allow model-driven probabilistic parsing. 
In the next section, we explain ModelCC's support for 
probabilistic language models. 



III. PROBABILISTIC PARSING IN MODELCC 

ModelCC effectively combines model-based language 
specification with probabilistic parsing by allowing the 
specification of arbitrary probabilistic language models. 

Subsection IIII.AI introduces ModelCC support for 
probabilistic language models and presents ModelCC's 
©Probability annotation. Subsection IIII.BI discusses the 
use of contextual information in ModelCC probabilistic 
parsers. Subsection IIII. CI explains how symbol probabil- 
ities are computed. 



Conventional probabilistic scanners and parsers do not 
allow the use of arbitrary probability estimators or sta- 
tistical models that take advantage of more context in- 
formation. 



B. Model-Based Language Specification 

In its most general sense, a model is anything used 
in any way to represent something else. In such sense, 
a grammar is a model of the language it defines. The 
idea behind model-based language specification is that, 
starting from a single abstract syntax model (ASM) that 
represents the core concepts in a language, language de- 
signers can develop one or several concrete syntax models 
(CSMs). These CSMs can suit the specific needs of the 
desired textual or graphical representation for language 
sentences. The ASM-CSM mapping can be performed, 
for instance, by annotating the abstract syntax model 
with the constraints needed to transform the elements in 
the abstract syntax into their concrete representation. 

A diagram summarizing the traditional language de- 
sign process is shown in Figure [TJ whereas the corre- 
sponding diagram for the model-based approach is shown 
in Figure [21 It should be noted that ASMs represent non- 
tree structures whenever language elements can refer to 
other language elements, hence the use of the 'abstract 
syntax graph' term. 

ModelCC [HI, []]| is a parser generator that supports 
a model-based approach to the design of language pro- 
cessing systems. Its starting ASM is created by defining 
classes that represent language elements and establish- 
ing relationships among those elements. Once the ASM 



A. Probabilistic Language Models 

ModelCC's @ 'Probability annotation allows the speci- 
fication of probability values for language elements and 
language element members. Probability values can be 
specified for syntactic elements of the languages and for 
lexical components, in which case it should be noted that 
the lexical analyzer behaves as a part-of-speech tagger in 
natural language processing. 

Such probability values can be specified using three al- 
ternatives: a probability value as a real number between 
and 1, a frequency as an integer number, or a cus- 
tom probability evaluator that computes the probability 
value from the analysis of the language element and its 
context. 

Since ModelCC supports lexical and syntactic ambigu- 
ities, and the combination of language models, one of the 
main novelties of ModelCC with respect to existing tech- 
niques is that it allows the modular specification of prob- 
abilistic languages, that is, it is able to produce parsers 
from composite language specifications even when some 
of the language elements overlap or conflict. 

ModelCC also supports alternative models for the rep- 
resentation of uncertainty (e.g. possibilistic models, 
models based on Dempster-Shafer theory, or any other 
soft computing models), provided that an evaluation op- 
erator for language element instances is provided and an 
evaluation operator for the application of grammar rules 
is provided. Optionally, a casting operator that trans- 
lates the estimated value in one model into a value valid 
for a different kind of model allows the specification of 
modular languages even when different mechanisms for 
representing natural language ambiguities are employed 
for different parts of the language model. 
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B. Context Information 

ModclCC provides context information that custom 
probability evaluators and constraints can take into ac- 
count when processing a language element. 

The context information includes the current syntax 
graph and the parse graph symbol corresponding to the 
language element being evaluated. Also, if the language 
element instance is a reference, the context information 
also includes the referenced language element instance, 
its corresponding parse graph symbol, and the context 
graph, which is the smallest graph that contains both 
the reference and the referenced object. 

It should be noted that, from this information, it is 
possible to deduce traditional metrics such as the dis- 
tance between the reference and the referenced object in 
the input or in the syntax graph and whether the refer- 
ence is anaphoric, cataphoric, or recursive. 

However, in contrast to existing probabilistic pars- 
ing techniques, ModelCC also allows the specification of 
complex syntactic constraints, semantic constraints, and 
probability evaluators that use extensive context infor- 
mation such as resolved references between language el- 
ements. 



C. Probability Evaluation 

The probability of a particular parse graph G for a 
sentence w\- m of length m is defined as the product of 
the probabilities associated to the n instances of language 
elements Ei in the parse graph G: 



P(G\w 1:m ) = Y[P(Ei 



(1) 



Given a language element E that represents a part-of- 
speech tag and a word w, the lexical analyzer acts as a 
POS tagger and provides P(E\w). 

Given a language element E with M±..M n members in 
its definition, some of which are optional, the probabil- 
ity P(E\Mi :n ) is computed as follows. Let OPT(E) be 
the set of optional elements for E. Assuming that their 
appearance is statistically independent, we can estimate 
the probability of E given its observed elements O: 

p(E\o 1:k ) = p(e) n pms) n (!- p 



MiEOPT(E), 
Mi£0 1]k 



Mj£OPT(E) 
Mj<£0 1:k 



(2) 

Given an ambiguous sentence wi :n , its disambiguation 
is done by picking the parse graph G with the highest 
probability for that sentence: 

We now proceed to present an example of natural lan- 
guage specification using ModclCC. 



G(wi- m ) = argmax{P(G|wi: 

G 



,)} 



(3) 



IV. MODEL-BASED SPECIFICATION OF NATURAL 
LANGUAGES 

In this section, we present a model-based specification 
for a probabilistic natural language parser. Subsection 
HV.AI outlines the general natural language features. Sub- 
section IIV.BI provides the ModelCC ASM specification 
of the general natural language. Subsection IIV.CI ex- 
plains how the general natural language can be instan- 
tiated. Subsection IIV.DI presents a sample English lan- 
guage parser. 



A. Natural Language Description 

Our general language model supports Chomsky's X- 
bar theory Q, which claims that certain human lan- 
guages share structural similarities. 

In our model, a sentence consists of a clause (i.e. a 
complete proposition), a clause can be either a simple 
clause or the coordinate clause composite that creates a 
compound sentence, a simple clause consists of an op- 
tional nominal phrase and a verbal phrase, and a coordi- 
nate clause composite consists of a set of clauses and an 
optional floating coordinating conjunction. 

In our general natural language model, a complement 
is a phrase used to complete a predicate construction, and 
a head is a complement that plays the same grammatical 
role as the whole predicate construction. 

Our general natural language model supports nominal, 
verbal, adverbial, adjectival, and prepositional comple- 
ments. 

Nominal complements comprise nominal phrases, nom- 
inal composites, and nominal clauses. A nominal phrase 
consists of an optional determiner, a noun, and an op- 
tional set of complements. A nominal composite consists 
of an optional determiner, a set of nominal complements 
and an optional floating conjunction. A nominal clause 
consists of an optional determiner, an optional subor- 
dinating conjunction and a subordinate clause. Nouns 
comprise common nouns, proper nouns, and pronouns. 
Pronouns, in turn, reference nouns and proper nouns. 

Verbal complements comprise verbal phrases and ver- 
bal composites. A verbal phrase consists of a set of float- 
ing verbs and an optional floating preposition. A verbal 
composite consists of a set of verbal complements and an 
optional floating conjunction. 
{Mj I -Adverbial complements comprise adverbial phrases, 
adverbial composites, and adverbial clauses. An adver- 
bial phrase consists of an adverb. An adverbial composite 
consists of a set of adverbial complements and an op- 
tional floating conjunction. An adverbial clause consists 
of an optional subordinating conjunction and a subordi- 
nate clause. 

Adjectival complements comprise adjectival compos- 
ites and adjectival clauses. An adjectival composite con- 
sists of a set of adjectival complements and an optional 
floating conjunction. An adjectival clause consists of an 
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Figure 3: ModelCC specification of the sentence and clause elements of our general natural language. 
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Figure 4: ModelCC specification of the phrase, head, and complement elements of our general natural language. 
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Figure 5: ModelCC specification of the verbal complement language elements in our general natural language. 
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Figure 6: ModelCC specification of the nominal complement language elements in our general natural language. 
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Figure 7: ModelCC specification of prepositional complement language elements in our general natural language. 
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optional subordinating conjunction and a subordinate 
clause. 

Prepositional complements comprise prepositional 
phrases and prepositional composite. A prepositional 
phrase consists of a floating preposition and a head. A 
prepositional composite consists of a set of prepositional 
complements and an optional floating conjunction. 

It should be noted that this general natural lan- 
guage embraces Romance languages such as Spanish, 
Portuguese, French, and Italian, as well as Germanic lan- 
guages such as English and German. 

B. ModelCC Specification of the ASM for Natural 
Languages 

In order to implement our general natural language 
parser using ModelCC, we have to provide a specifica- 
tion of the language ASM. This specification is provided 
as a set of UML diagrams, as illustrated in Figures |31 
I3J El and Adjectival complements and adverbial 
complements can be specified as nominal complements 
in Figure [6] As it can be observed from the figures, the 
model-based specification of the general natural language 
is easily obtained from the language description require- 
ments. 

As the specified model is an abstract syntax model, it 
does no correspond to any particular model. The ASM 
is more like the Mentalese language postulated by the 
Language Of Thought Hypothesis @- In the next sub- 
section, we explain how different fully-functional natural 
language parsers can be instantiated from this model by 
defining additional language-specific constraints. 

C. Specification of the Natural Language CSMs 

In order to implement a parser for a particular natural 
language, the ASM-CSM mapping has to be specified. A 
pattern matcher is assigned to each lexical component 
of the language model. For this purpose, ModelCC's 
@Pattern annotation allows the specification of custom 
pattern matchers that can consist of regular expressions, 
dictionary lookups, or any suitable heuristics. Such pat- 
tern matchers can easily be induced from the analysis of 
lexicons. 

ModelCC supports lexical ambiguities apart from syn- 
tactic ambiguities, so the specified pattern matchers can 
produce different, and even overlapping, sets of tokens 
from the analysis of the input string. 

After specifying the pattern matchers for the lexi- 
cal components of the language, language-specific con- 
straints are assigned to syntactic components of the lan- 
guage model. For this purpose, ModelCC's ©Constraint 
annotation allows the specification of methods that eval- 
uate whether a language element instance is valid or not. 
These constraints can be automatically induced from the 
analysis of linguistic corpora and, as explained in Subsec- 
tion IIII.B1 these constraints can take into account exten- 



sive context information, which can even include resolved 
references between language elements. 

Finally, in order to produce a probabilistic parser, 
probability evaluators are assigned to the different lan- 
guage constructions. For this purpose, ModelCC's 
©Probability annotation allows the specification of the 
probability evaluation for language elements. These 
probabilities can also be estimated from the analysis of 
linguistic corpora, although heuristics could also be used. 



D. An Example: Parsing an English Sentence 

We have implemented an English parser by specifying 
an ASM-CSM mapping from the general language ASM. 

We have defined pattern matchers that query wik- 
tionary.org to perform the lexical analysis. We have ap- 
proximated probability values derived from the analysis 
of the Google n-gram datasets to different lexemes and 
constructions. 

As an example, we have parsed the sentence "I saw 
a picture of New York" . The lexical analysis graph for 
this sentence represents 128 valid token sequences and is 
shown in Figure [5] A set of valid parse graphs can be 
obtained from this lexical analysis graph and Figure [9] 
shows the correct parse tree. 



V. CONCLUSIONS AND FUTURE WORK 

Natural languages suffer from ambiguities. A common 
approach to disambiguation consists of performing prob- 
abilistic scanning and probabilistic parsing. Such tech- 
niques present several drawbacks: wrong sequences of 
tokens may be produced, and only small amounts of con- 
text information are used. 

We have described ModelCC's support for probabilistic 
language models. ModelCC is a model-based parser gen- 
erator that supports lexical ambiguities, syntactic ambi- 
guities, and reference resolution. 

Also, we have demonstrated the application of Mod- 
elCC to probabilistic parsing by providing a model-based 
specification of a general natural language, providing an 
English-language instantiation of it. 

We plan to do research on the automatic induction 
of probabilistic language models, syntactic constraints, 
and semantic constraints from linguistic corpora. We 
also plan to do research on the application of alternative 
models for the representation of uncertainty to natural 
language parsing. 
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Figure 8: Lexical analysis graph for the sentence "I saw a picture of New York". 
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